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Abstract 

This  paper  describes  automatic  techniques  for  mapping  9611  entries  in  a  database  of 
English  verbs  to  WordNet  senses.  The  verbs  were  initially  grouped  into  491  classes  based 
on  syntactic  categories.  Mapping  these  classified  verbs  into  WordNet  senses  provides  a 
resource  that  may  be  used  for  disambiguation  in  multilingual  applications  such  as  ma¬ 
chine  translation  and  cross-language  information  retrieval.  Our  techniques  make  use  of 
(1)  a  training  set  of  1791  disambiguated  entries,  representing  1442  verb  entires  from  167 
of  the  categories;  (2)  word  sense  probabilities  based  on  frequency  counts  in  a  previously 
tagged  corpus;  (3)  semantic  similarity  of  WordNet  senses  for  verbs  within  the  same  class; 
(4)  probabilistic  correlations  between  WordNet  data  and  attributes  of  the  verb  classes. 
The  best  results  achieved  72%  precision  and  58%  recall,  versus  a  lower  bound  of  62% 
precision  and  38%  recall  for  assigning  the  most  ferquently  occurring  WordNet  sense,  and 
an  upper  bound  of  87%  precision  and  75%  recall  for  human  judgment. 
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Abstract 

This  paper  describes  automatic  tech¬ 
niques  for  mapping  9611  entries  in  a 
database  of  English  verbs  to  Word- 
Net  senses.  The  verbs  were  initially 
grouped  into  49 1  classes  based  on  syn¬ 
tactic  categories.  Mapping  these  verbs 
into  WordNet  senses  provides  a  re¬ 
source  that  may  be  used  for  disam¬ 
biguation  in  multilingual  applications 
such  as  machine  translation  and  cross¬ 
language  information  retrieval.  Our 
techniques  make  use  of  (1)  a  training 
set  of  1791  disambiguated  entries,  rep¬ 
resenting  1442  verb  entries  from  167  of 
the  categories;  (2)  word  sense  proba¬ 
bilities  based  on  frequency  counts  in  a 
previously  tagged  corpus;  (3)  semantic 
similarity  of  WordNet  senses  for  verbs 
within  the  same  class;  (4)  probabilis¬ 
tic  correlations  between  WordNet  data 
and  attributes  of  the  verb  classes.  The 
best  results  achieved  72%  precision  and 
58%  recall,  versus  a  lower  bound  of 
62%  precision  and  38%  recall  for  as¬ 
signing  the  most  frequently  occurring 
WordNet  sense,  and  an  upper  bound  of 
87%  precision  and  75%  recall  for  hu¬ 
man  judgment. 

1  Introduction 

Our  goal  is  to  map  entries  in  a  lexical  database 

of  4076  English  verbs  automatically  to  Word- 

Net  senses  (Miller  and  Eellbaum,  1991),  (Eell- 


baum,  1998)  to  support  such  applications  as  ma¬ 
chine  translation  and  cross-language  information 
retrieval.  Eor  example,  the  verb  drop  is  multi¬ 
ply  ambiguous,  with  many  potential  translations 
in  Spanish:  bajar,  caerse,  dejar,  caer,  derribar, 
disminuir,  echar,  hundir,  soltar,  etc.  The  lexical 
database  specifies  a  set  of  interpretations  for  the 
verb  drop,  depending  on  its  context  in  the  source- 
language  (SE).  Inclusion  of  WordNet  senses  in  the 
lexical  database  enables  the  selection  of  an  appro¬ 
priate  verb  in  the  target  language  (TE).  Einal  se¬ 
lection  is  based  on  a  frequency  count  of  Word- 
Net  senses  across  all  semantic  classes  to  which 
the  verb  belongs — e.g.,  disminuir  is  selected  when 
the  WordNet  sense  corresponds  to  the  meaning  of 
drop  in  Prices  dropped. 

Our  task  differs  from  prototypical  word  sense 
disambiguation  (WSD)  in  several  ways.  Eirst,  the 
words  to  be  disambiguated  are  not  tokens  in  a 
text  corpus,  but  entries  in  a  lexical  database.  Sec¬ 
ond,  we  take  an  “all- words”  approach  rather  than  a 
“lexical- sample”  approach  (Kilgarriff  and  Rosen- 
zweig,  2000):  All  words  in  the  lexical  database 
“texf’  are  disambiguated,  not  just  a  small  number 
for  which  detailed  knowledge  is  available.  Third, 
we  replace  the  contextual  data  typically  used  for 
WSD  with  information  about  verb  senses  encoded 
in  terms  of  thematic  grids  and  lexical-semantic 
representations  from  (Olsen  et  al.,  1997).  Eourth, 
whereas  it  is  often  assumed  that  only  one  word 
sense  is  accurate  for  each  token  in  a  text  cor¬ 
pus,  the  absence  of  sentential  context  leads  to  a 
situation  where  several  WordNet  senses  may  be 
equally  appropriate  for  a  database  entry.  Indeed, 
since  distinctions  between  WordNet  senses  are  of- 


ten  fine-grained  (Palmer,  2000),  it  can  be  unclear, 
even  in  context,  which  of  several  senses  is  in¬ 
voked  or  even  if  only  one  sense  is  invoked. 

The  verb  database  contains  mostly  syntactic  in¬ 
formation  about  its  entries,  much  of  which  ap¬ 
plies  at  the  class  level  within  the  database.  Word- 
Net,  on  the  other  hand,  is  a  significant  source  for 
information  about  semantic  relationships,  much 
of  which  applies  at  the  “synset”  level  (“synsets” 
are  WordNet’s  groupings  of  synonymous  word 
senses).  Mapping  entries  in  the  database  to  their 
corresponding  WordNet  senses  greatly  extends 
the  semantic  potential  of  the  database. 

2  Lexical  Resources 

We  use  an  existing  classification  of  4076  English 
verbs,  based  initially  on  English  Verbs  Classes 
and  Alternations  (Levin,  1993)  and  extended 
through  the  splitting  of  some  classes  into  sub¬ 
classes  and  the  addition  of  new  classes.  The  re¬ 
sulting  491  classes  (e.g.,  “Roll  Verbs,  Group  I”, 
which  includes  drift,  drop,  glide,  roll,  swing)  are 
referred  to  here  as  Levin+  classes.  As  verbs  may 
be  assigned  to  multiple  Levin-i-  classes,  the  actual 
number  of  entries  in  the  database  is  larger,  9611. 

Following  the  model  of  (Dorr  and  Olsen,  1997), 
each  Levin-i-  class  is  associated  with  a  thematic 
grid  (henceforth  abbreviated  0-grid),  which  sum¬ 
marizes  a  verb’s  syntactic  behavior  by  specify¬ 
ing  its  predicate  argument  structure.  For  exam¬ 
ple,  the  Levin-i-  class  “Roll  Verbs,  Group  I”  is  as¬ 
sociated  with  the  0-grid  [th  goal]  ,  in  which  a 
theme  and  a  goal  are  used  (e.g..  The  ball  dropped 
to  the  ground).^  Each  0-grid  specification  corre¬ 
sponds  to  a  Grid  class.  There  are  48  Grid  classes, 
with  a  one-to-many  relationship  between  Grid  and 
Levin-i-  classes. 

WordNet,  the  lexical  resource  to  which  we  are 
mapping  entries  from  the  lexical  database,  groups 
synonymous  word  senses  into  “synsets”  and  struc¬ 
tures  the  synsets  into  part-of-speech  hierarchies. 
Our  mapping  operation  uses  several  other  data  el¬ 
ements  pertaining  to  WordNet:  semantic  relation¬ 
ships  between  synsets,  frequency  data,  and  syn¬ 
tactic  information. 

’There  is  also  a  Levin-l-  class  “Roll  Verbs,  Group  IT’ 
which  is  associated  with  the  S-grid  [th  particle(down)],  in 
which  a  theme  and  a  particle  ‘down’  are  used  (e.g.,  The  ball 
dropped  down). 


Seven  semantic  relationship  types  exist  be¬ 
tween  synsets,  including,  for  example,  antonymy, 
hyperonymy,  and  entailment.  Synsets  are  often 
related  to  a  half  dozen  or  more  other  synsets;  they 
may  be  related  to  multiple  synsets  through  a  single 
relationship  or  may  be  related  to  a  single  synset 
through  multiple  relationship  types. 

Our  frequency  data  for  WordNet  senses  is  de¬ 
rived  from  SEMCOR — a  semantic  concordance  in¬ 
corporating  tagging  of  the  Brown  corpus  with 
WordNet  senses.^ 

Syntactic  patterns  (“frames”)  are  associated 

with  each  synset,  e.g..  Somebody _ s  something; 

Something _ s;  Somebody _ s  somebody  into 

V-ing  something.  There  are  35  such  verb  frames 
in  WordNet  and  a  synset  may  have  only  one  or  as 
many  as  a  half  dozen  or  so  frames  assigned  to  it. 

Our  mapping  of  verbs  in  Levin-i-  classes  to 
WordNet  senses  relies  in  part  on  the  relation  be¬ 
tween  thematic  roles  in  Levin-i-  and  verb  frames  in 
WordNet.  Both  reflect  how  many  and  what  kinds 
of  arguments  a  verb  may  take.  However,  con¬ 
structing  a  direct  mapping  between  0-grids  and 
WordNet  frames  is  not  possible,  since  the  un¬ 
derlying  classifications  differ  in  significant  ways. 
The  correlations  between  the  two  sets  of  data  are 
instead  viewed  probabilistically — as  described  in 
Section  3. 

Table  1  illustrates  the  relation  between  each  of 
the  resources  above  for  the  verb  drop.  In  our 
multilingual  applications  (e.g.,  lexical  selection  in 
machine  translation),  the  Grid  information  pro¬ 
vides  a  context-based  means  of  associating  a  verb 
with  a  Levin-i-  class  according  to  its  usage  in  the 
SL  sentence.  The  WordNet  sense  possibilities  are 
thus  pared  down  during  SL  analysis,  but  not  suffi¬ 
ciently  for  the  final  selection  of  a  TL  verb.  For  ex¬ 
ample,  Levin-i-  class  9.4  has  three  possible  Word- 
Net  senses  for  drop.  However,  the  WordNet  sense 
8  is  not  associated  with  any  of  the  other  classes; 
thus,  it  is  considered  to  have  a  higher  “information 
content”  than  the  others.  The  upshot  is  that  the 
lexical-selection  routine  prefers  dejar  caer  over 
other  translations  such  as  derribar  and  bajar? 

^For  further  information  see  the  WordNet  manuals,  sec¬ 
tion  7,  SEMCOR  at  http://www.cogsci.princeton.edu. 

^This  lexical-selection  approach  is  an  adaptation  of  the 
notion  of  reduction  in  entropy,  measured  by  information 
gain  (Mitchell,  1997).  Using  information  content  to  quan¬ 
tify  the  “value”  of  a  class  in  the  WordNet  hierarchy  has 


Levin-i- 

Grid/Example 

WN  Sense 

Spanish  Verb(s) 

9.4 

Directional 

Put 

[ag  th  mod-loc  src  goal] 

I  dropped  the  stone 

1.  move,  displace 

2.  descend,  fall,  go  down 

8.  drop  set  down,  put  down 

1.  derribar,  echar 

2.  bajar,  caerse 

8.  dejar  caer,  echar,  soltar 

45.6 

Calibratable 
Change  of 
State 

[th] 

Prices  dropped 

1.  move,  displace 

3.  decline,  go  down,  wane 

1.  derribar,  echar 

3.  disminuir 

47.7 

Meander 

[th  src  goal] 

The  river  dropped  from 
the  lake  to  the  sea 

2.  descend,  fall,  go  down 

4.  sink,  drop,  drop  down 

2.  bajar,  caerse 

4.  hundir,  caer 

51.3.1 

RollI 

[th  goal] 

The  ball  dropped  to  the 
ground 

2.  descend,  fall,  go  down 

2.  bajar,  caerse 

51.3.1 

Roll  II 

[th  particle(down)] 

The  ball  dropped  down 

2.  descend,  fall,  go  down 

2.  bajar,  caerse 

Table  1:  Relation  Between  Levin+  and  WN  Senses  for  ‘drop’ 


The  other  classes  are  similarly  associated  with  ap¬ 
propriate  TL  verbs  during  lexical  selection:  dis- 
minuir  {class  45.6),  hundir  (class  A1 .1),  and  bajar 
(class  51.3. 1).^ 

3  Training  Data 

We  began  with  the  lexical  database  of  (Dorr  and 
Jones,  1996),  which  contains  a  significant  number 
of  WordNet-tagged  verb  entries.  Some  of  the  as¬ 
signments  were  in  doubt,  since  class  splitting  had 
occurred  subsequent  to  those  assignments,  with 
all  old  WordNet  senses  carried  over  to  new  sub¬ 
classes.  New  classes  had  also  been  added  since 
the  manual  tagging.  It  was  determined  that  the 
tagging  for  only  1791  entries — including  1442 
verbs  in  167  classes — could  be  considered  stable; 
for  these  entries,  2756  assignments  of  WordNet 
senses  had  been  made.  Data  for  these  entries, 
taken  from  both  WordNet  and  the  verb  lexicon, 
constitute  the  training  data  for  this  study. 

The  following  probabilities  were  generated 
from  the  training  data: 

also  been  used  for  measuring  semantic  similarity  in  a  tax¬ 
onomy  (Resnik,  1999b).  More  recently,  context-based  mod¬ 
els  of  disambiguation  have  been  shown  to  represent  signif¬ 
icant  improvements  over  the  baseline  (Bangalore  and  Ram- 
bow,  2000),  (Ratnaparkhi,  2000). 

‘^The  full  set  of  Spanish  translations  is  selected  from 
WordNet  associations  developed  in  the  Euro  WordNet  effort 
(Dorr  et  ah,  1997). 


•  Grid  probability^  = 

where  each  occurrence  of  involves  relating 
synset  si  through  relationship  type  x  to  another 
synset  S2,  and  where  si  is  mapped  to  by  a  verb  in 
Grid  class  Gi  and  S2  is  mapped  to  by  a  verb  in 
Grid  class  G2 .  This  is  the  probability  that  if  one 
synset  is  related  to  another  through  a  particular 
relationship  type,  then  a  verb  mapped  to  the  first 
synset  will  belong  to  the  same  Grid  class  as  a  verb 
mapped  to  the  second  synset.  Computed  values 
generally  range  between  .3  and  .35. 

•  Levin  +  probability^  =  ^ 

where  is  as  above,  exceot  that  Si  is  mapped  to 
by  a  verb  in  Levin-i-  class  L-i-i  and  S2  is  mapped 
to  by  a  verb  in  Levin-i-  class  L-1-2.  This  is  the 
probability  that  if  one  synset  is  related  to  another 
through  a  particular  relationship  type,  then  a 
verb  mapped  to  the  first  synset  will  belong  to 
the  same  Levin-i-  class  as  a  verb  mapped  to  the 
second  synset.  Computed  values  generally  range 
between  .25  and  .3. 

•  Tot  frame  probability^  j  = 

where  is  the  occurrence  of  the  entire  0-grid  i 
for  verb  entry  v  and  cfj  „  is  the  occurrence  of  the 
entire  frame  sequence  j  for  a  WordNet  sense  to 
which  verb  entry  v  is  mapped.  This  is  the  prob¬ 
ability  that  a  verb  in  a  Levin-f  class  is  mapped  to 
a  WordNet  verb  sense  with  some  specific  combi- 


nation  of  frames.  Values  average  only  .11,  but  in 
some  cases  the  probability  is  1.0. 

•  Indv  frame  probabilityj  j  = 

where  Oi^y  is  the  occurrence  of  the  single  0-grid 
component  i  for  verb  entry  v  and  cfj^„  is  the  occur¬ 
rence  of  the  single  frame  j  for  a  WordNet  sense  to 
which  verb  entry  v  is  mapped.  This  is  the  proba¬ 
bility  that  a  verb  in  a  Levin-i-  class  with  a  partic¬ 
ular  0-grid  component  (possibly  among  others)  is 
mapped  to  a  WordNet  verb  sense  assigned  a  spe¬ 
cific  frame  (possibly  among  others).  Values  aver¬ 
age  .20,  but  in  some  cases  the  probability  is  1.0. 

•  Prior  WN  probabilityg  =  |||^ >  where 
tg  is  an  occurrence  of  tag  s  (for  a  particular  synset) 
in  SEMCOR  and  is  an  occurrence  of  any  of  a  set 
of  tags  for  verb  v  in  SEMCOR,  with  s  being  one 
of  the  senses  possible  for  verb  v.  This  probability 
is  the  prior  probability  of  specific  WordNet  verb 
senses.  Values  average  .11,  but  in  some  cases  the 
probability  is  1.0. 

In  addition  to  the  foregoing  data  elements, 
based  on  the  training  set,  we  also  made  use  of 
a  semantic  similarity  measure,  which  reflects  the 
confidence  with  which  a  verb,  given  the  total  set 
of  verbs  assigned  to  its  Levin-i-  class,  is  mapped 
to  a  specific  WordNet  sense.  This  represents  an 
implementation  of  a  class  disambiguation  algo¬ 
rithm  (Resnik,  1999a),  modified  to  run  against  the 
WordNet  verb  hierarchy.^ 

We  also  made  a  powerful  “same-synset  as¬ 
sumption”:  If  (1)  two  verbs  are  assigned  to  the 
same  Levin-i-  class,  (2)  one  of  the  verbs  vi  has 
been  mapped  to  a  specific  WordNet  sense  si,  and 
(3)  the  other  verb  V2  has  a  WordNet  sense  S2  syn¬ 
onymous  with  Si ,  then  V2  should  be  mapped  to  S2 . 
Since  WordNet  groups  synonymous  word  senses 
into  “synsets,”  si  and  S2  would  correspond  to 
the  same  synset.  Since  Levin-i-  verbs  are  mapped 
to  WordNet  senses  via  their  corresponding  synset 
identifiers,  when  the  set  of  conditions  enumer¬ 
ated  above  are  met,  the  two  verb  entries  would  be 
mapped  to  the  same  WordNet  synset. 

^The  assumption  underlying  this  measure  is  that  the  ap¬ 
propriate  word  senses  for  a  group  of  semantically  related 
words  should  themselves  be  semantically  related.  Given 
WordNet’s  hierarchical  structure,  the  semantic  similarity  be¬ 
tween  two  WordNet  senses  corresponds  to  the  degree  of  in¬ 
formativeness  of  the  most  specific  concept  that  subsumes 
them  both. 


As  an  example,  the  two  verbs  tag  and  mark 
have  been  assigned  to  the  same  Levin-i-  class.  In 
WordNet,  each  occurs  in  five  synsets,  only  one 
in  which  they  both  occur.  If  tag  has  a  WordNet 
synset  assigned  to  it  for  the  Levin-i-  class  it  shares 
with  mark,  and  it  is  the  synset  that  covers  senses 
of  both  tag  and  mark,  we  can  safely  assume  that 
that  synset  is  also  appropriate  for  mark,  since  in 
that  context,  the  two  verb  senses  are  synonymous. 

4  Evaluation 

Subsequent  to  the  culling  of  the  training  set,  sev¬ 
eral  processes  were  undertaken  that  resulted  in 
full  mapping  of  entries  in  the  lexical  database  to 
WordNet  senses.  Much,  but  not  all,  of  this  map¬ 
ping  was  accomplished  manually. 

Each  entry  whose  WordNet  senses  were  as¬ 
signed  manually  was  considered  by  at  least  two 
coders,  one  coder  who  was  involved  in  the  entire 
manual  assignment  process  and  the  other  drawn 
from  a  handful  of  coders  working  independently 
on  different  subsets  of  the  verb  lexicon.  In  the 
manual  tagging,  if  a  WordNet  sense  was  consid¬ 
ered  appropriate  for  a  lexical  entry  by  any  one  of 
the  coders,  it  was  assigned.  Overall,  13452  Word- 
Net  sense  assignments  were  made.  Of  these,  51% 
were  agreed  upon  by  multiple  coders.  The  kappa 
coefficient  {K)  of  intercoder  agreement  was  .47 
for  a  first  round  of  manual  tagging  and  (only)  .24 
for  a  second  round  of  more  problematic  cases.^ 

While  the  full  tagging  of  the  lexical  database 
may  make  the  automatic  tagging  task  appear  su¬ 
perfluous,  the  low  rate  of  agreement  between 
coders  and  the  automatic  nature  of  some  of  the 
tagging  suggest  there  is  still  room  for  adjust¬ 
ment  of  WordNet  sense  assignments  in  the  verb 
database.  On  the  one  hand,  even  the  higher  of 
the  kappa  coefficients  mentioned  above  is  signifi¬ 
cantly  lower  than  the  standard  suggested  for  good 
reliability  {K  >  .8)  or  even  the  level  where  ten- 

*The  kappa  statistic  measures  the  degree  to  which  pair¬ 
wise  agreement  of  coders  on  a  classification  task  surpasses 
what  would  be  expected  by  chance;  the  standard  definition 
of  this  coefficient  is:  K  =  {P{A)  —  P{E))/{1  —  P{E)), 
where  P{A)  is  the  actual  percentage  of  agreement  and  P{E) 
is  the  expected  percentage  of  agreement,  averaged  over  all 
pairs  of  assignments.  Several  adjustments  in  the  computation 
of  the  kappa  coefficient  were  made  necessary  by  the  possible 
assignment  of  multiple  senses  for  each  verb  in  a  Levin+  class, 
since  without  prior  knowledge  of  how  many  senses  are  to  be 
assigned,  there  is  no  basis  on  which  to  compute  P{E). 


tative  conclusions  may  be  drawn  (.67  <  K  < 
.8)  (Carletta,  1996),  (Krippendorff,  1980).  On 
the  other  hand,  if  the  automatic  assignments  agree 
with  human  coding  at  levels  comparable  to  the  de¬ 
gree  of  agreement  among  humans,  it  may  be  used 
to  identify  current  assignments  that  need  review 
and  to  suggest  new  assignments  for  consideration. 

In  addition,  there  are  consistency  checks  that 
can  be  made  more  easily  by  the  automatic  pro¬ 
cess  than  by  hand.  For  example,  the  same-synset 
assumption  is  much  more  easily  enforced  auto¬ 
matically  than  manually.  When  this  assumption 
is  implemented  for  the  2756  senses  in  the  training 
set,  another  967  sense  assignments  are  generated, 
only  131  of  which  were  actually  assigned  manu¬ 
ally.  Similarly,  when  such  a  premise  is  enforced 
on  the  entirety  of  the  lexical  database  of  13452 
assignments,  another  5059  sense  assignments  are 
generated.  If  the  same-synset  assumption  is  valid 
and  if  the  senses  assigned  in  the  database  are  ac¬ 
curate,  then  the  human  tagging  has  a  recall  of  no 
more  than  73%. 

Because  a  word  sense  was  assigned  even  if  only 
one  coder  judged  it  to  apply,  human  coding  has 
been  treated  as  having  a  precision  of  100%.  How¬ 
ever,  some  of  the  solo  judgments  are  likely  to  have 
been  in  error.  To  determine  what  proportion  of 
such  judgments  were  in  reality  precision  failures, 
a  random  sample  of  50  WordNet  senses  supported 
by  only  one  of  the  two  original  coders  was  inves¬ 
tigated  further  by  a  team  of  three  judges.  In  this 
round,  judges  rated  the  WordNet  senses  assigned 
to  the  verb  entries  as  falling  into  one  of  three  cate¬ 
gories:  definitely  correct,  definitely  incorrect,  and 
arguable  whether  correct.  As  it  turned  out,  if  any 
one  of  the  judges  rated  a  sense  definitely  correct, 
another  judge  independently  judged  it  definitely 
correct;  this  accounts  for  31  instances.  In  13  in¬ 
stances  the  assignments  were  judged  definitely  in¬ 
correct  by  at  least  two  of  the  judges.  No  consen¬ 
sus  was  reached  on  the  remaining  6  instances.  Ex¬ 
trapolating  from  this  sample  to  the  full  set  of  judg¬ 
ments  in  the  database  supported  by  only  one  coder 
leads  to  an  estimate  that  approximately  1725  (26% 
of  6636  solo  judgments)  of  those  senses  are  incor¬ 
rect.  This  suggests  that  the  precision  of  the  human 
coding  is  approximately  87%. 

The  upper  bound  for  this  task,  as  set  by  human 
performance,  is  thus  73%  recall  and  87%  preci¬ 


sion.  The  lower  bound,  based  on  assigning  the 
WordNet  sense  with  the  greatest  prior  probability, 
is  38%  recall  and  62%  precision. 

5  Mapping  Strategies 

Recent  work  (Van  Halteren  et  al.,  1998)  has 
demonstrated  improvement  in  part-of-speech  tag¬ 
ging  when  the  outputs  of  multiple  taggers  are 
combined.  When  the  errors  of  multiple  classi¬ 
fiers  are  nof  significanfly  correlafed,  fhe  resulf  of 
combining  vofes  from  a  sef  of  individual  classi¬ 
fiers  offen  oufperforms  fhe  besf  resulf  from  any 
single  classifier.  Using  a  vofing  sfrafegy  seems  es¬ 
pecially  appropriafe  here:  The  measures  ouflined 
in  Secfion  3  average  only  41%  recall  on  fhe  frain- 
ing  sef,  buf  fhe  senses  picked  ouf  by  fheir  highesf 
values  vary  significanfly. 

The  invesfigafions  underfaken  used  bofh  sim¬ 
ple  and  aggregate  vofers,  combined  using  var¬ 
ious  vofing  sfrafegies.  The  simple  vofers  were 
fhe  7  measures  previously  infroduced.^  In  addi- 
fion,  fhree  aggregafe  vofers  were  generafed:  (1) 
fhe  producf  of  fhe  simple  measures  (smoofhed  so 
fhaf  zero  values  wouldnT  offsel  all  ofher  mea¬ 
sures);  (2)  fhe  weighfed  sum  of  fhe  simple  mea¬ 
sures,  wifh  weighfs  represenfing  fhe  percenfage  of 
fhe  fraining  sef  assignmenfs  correcfly  idenfified  by 
fhe  highesf  score  of  fhe  simple  probabilifies;  and 
(3)  fhe  maximum  score  of  fhe  simple  measures. 

Using  fhese  dafa,  fwo  differenl  types  of  vof- 
ing  schemes  were  invesfigafed.  The  schemes  dif¬ 
fer  mosf  significanfly  on  fhe  circumsfances  un¬ 
der  which  a  vofer  casfs  ifs  vole  for  a  WordNel 
sense,  fhe  size  of  fhe  vole  casl  by  each  vofer,  and 
fhe  circumsfances  under  which  a  WordNel  sense 
was  selected.  We  will  refer  lo  fhese  fwo  schemes 
as  Majority  Voting  Scheme  and  Threshold  Voting 
Scheme. 

5.1  Majority  Voting  Scheme 

Allhough  we  do  nof  know  in  advance  how  many 
WordNel  senses  should  be  assigned  lo  an  enlry  in 
fhe  lexical  dalabase,  we  assume  lhal,  in  general, 
Ihere  is  al  leasl  one.  In  line  wifh  Ihis  inluilion,  one 
sfrafegy  we  invesfigafed  was  lo  have  bofh  simple 
and  aggregafe  measures  casl  a  vole  for  whichever 

’Only  6  measures  (including  the  semantic  similarity  mea¬ 
sure)  were  set  out  in  the  earlier  section;  the  measures  total  7 
because  Indv  frame  probability  is  used  in  two  different  ways. 


sense(s)  of  a  verb  in  a  semantic  class  received  the 
highest  (non-zero)  value  for  that  measure.  Ten 
variations  are  given  here: 

•  Prioii*r  ob:  Prior  Probability  of  WordNet 
senses 

•  SemSim:  Semantic  Similarity 

•  SimpleProd:  Product  of  all  simple  measures 

•  Simple WtdSum:  Weighted  sum  of  all  sim¬ 
ple  measures 

•  MajSimpleSgl:  Majority  vote  of  all  (7)  sim¬ 
ple  voters 

•  MajSimplePair:  Majority  vote  of  all  (21) 
pairs  of  simple  voters^ 

•  MajAggn  Majority  vote  of  SimpleProd  and 
SimpleWtdSum 

•  Maj3Best:  Majority  vote  of  SemSim,  Sim¬ 
pleProd,  and  SimpleWtdSum 

•  MajSgl-t-Aggr:  Majority  vote  of  MajSim¬ 
pleSgl  and  MajAggr 

•  MajPair+Aggn  Majority  vote  of  MajSim¬ 
plePair  and  MajAggr 

Table  2  gives  recall  and  precision  measures  for 
all  variations  of  this  voting  scheme,  both  with  and 
without  enforcement  of  the  same-synset  assump¬ 
tion.  If  we  use  the  product  of  recall  and  precision 
as  a  criterion  for  comparing  results,  the  best  vot¬ 
ing  scheme  is  MajAggr,  with  58%  recall  and  72% 
precision  without  enforcement  of  the  same-synset 
assumption.  Note  that  if  the  same-synset  assump¬ 
tion  is  correct,  the  drop  in  precision  that  accompa¬ 
nies  its  enforcement  mostly  reflects  inconsisten¬ 
cies  in  human  judgments  in  the  training  set;  the 
true  precision  value  for  MajAggr  after  enforcing 
the  same-synset  assumption  is  probably  close  to 
67%. 

Of  the  simple  voters,  only  PriorProb  and  Sem¬ 
Sim  are  individually  strong  enough  to  warrant  dis¬ 
cussion.  Although  PriorProb  was  used  to  estab¬ 
lish  our  lower  bound,  SemSim  proves  to  be  the 

•  A  pair  cast  a  vote  for  a  sense  if,  among  all  the  senses  of  a 
verb,  a  specific  sense  had  the  highest  value  for  both  measures. 


Variation 

w/o  ss 

w/ss 

R 

P 

R 

P 

PriorProb 

38% 

62% 

45% 

46% 

SemSim 

56% 

71% 

60% 

55% 

SimpleProd 

51% 

74% 

57% 

55% 

SimpleWtdSum 

53% 

77% 

58% 

56% 

MajSimpleSgl 

23% 

71% 

30% 

48% 

MajSimplePair 

38% 

60% 

45% 

43% 

MajAggr 

58% 

72% 

63% 

53% 

Maj3Best 

52% 

78% 

57% 

57% 

MajSgl-t-Aggr 

44% 

74% 

50% 

54% 

MajPair-i-Aggr 

49% 

77% 

55% 

57% 

Table  2:  Recall  (R)  and  Precision  (P)  for  Major¬ 
ity  Voting  Scheme,  With  and  Without  the  Same- 
Synset  assumption 

stronger  voter,  bested  only  by  MajAggr  (the  ma¬ 
jority  vote  of  SimpleProd  and  SimpleWtdSum)  in 
voting  that  enforces  the  same-synset  assumption. 
Both  PriorProb  and  SemSim  provide  better  results 
than  the  majority  vote  of  all  7  simple  voters  (Ma¬ 
jSimpleSgl)  and  the  majority  vote  of  all  21  pairs 
of  simple  voters  (MajSimplePair).  Moreover,  the 
inclusion  of  MajSimpleSgl  and  MajSimplePair  in 
a  majority  vote  with  MajAggr  (in  MajSgl-i-Agr 
and  MapPair-i-Aggr,  respectively)  turn  in  poorer 
results  than  MajAggr  alone. 

The  poor  performance  of  MajSimpleSgl  and 
MajSimplePair  do  not  point,  however,  to  a  gen¬ 
eral  failure  of  the  principle  that  multiple  voters 
are  better  than  individual  voters.  SimpleProd,  the 
product  of  all  simple  measures,  and  SimpleWtd¬ 
Sum,  the  weighted  sum  of  all  simple  measures, 
provide  reasonably  strong  results,  and  a  majority 
vote  of  the  both  of  them  (MajAggr)  gives  the  best 
results  of  all.  When  they  are  joined  by  SemSim  in 
Maj3Best,  they  continue  to  provide  good  results. 

The  bottom  line  is  that  SemSim  makes  the  most 
significant  contribution  of  any  single  simple  voter, 
while  the  product  and  weighted  sums  of  all  simple 
voters,  in  concert  with  each  other,  provide  the  best 
results  of  all. 

5.2  Threshold  Voting  Scheme 

The  second  voting  strategy  first  identified,  for 
each  simple  and  aggregate  measure,  the  threshold 
value  at  which  the  product  of  recall  and  precision 


Variation 

R 

P 

AutoMap-i- 

61% 

54% 

AutoMap- 

61% 

54% 

Triples 

63% 

52% 

Combo 

53% 

44% 

Combo&Auto 

59% 

45% 

Table  3:  Recall  (R)  and  Precision  (P)  for  Thresh¬ 
old  Voting  Scheme 

scores  in  the  training  set  has  the  highest  value  if 
that  threshold  is  used  to  select  WordNet  senses. 
During  the  voting,  if  a  WordNet  sense  has  a  higher 
score  for  a  measure  than  its  threshold,  the  measure 
votes  for  the  sense;  otherwise,  it  votes  against  it. 
The  weight  of  the  measure’s  vote  is  the  precision- 
recall  product  at  the  threshold.  This  voting  strat¬ 
egy  has  the  advantage  of  taking  into  account  each 
individual  attribute’s  strength  of  prediction. 

Five  variations  on  this  basic  voting  scheme 
were  investigated.  In  each,  senses  were  selected 
if  their  vote  total  exceeded  a  variation-specific 
threshold.  Table  3  summarizes  recall  and  pre¬ 
cision  for  these  variations  at  their  optimal  vote 
thresholds. 

In  the  AutoMapf  variation.  Grid  and  Levin-i- 
probabilities  abstain  from  voting  when  their  val¬ 
ues  are  zero  (a  common  occurrence,  because 
of  data  sparsity  in  the  training  set);  the  same- 
synset  assumption  is  automatically  implemented. 
AutoMap-  differs  in  that  it  disregards  the  Grid 
and  Levin-i-  probabilities  completely.  The  Triples 
variation  places  the  simple  and  composite  mea¬ 
sures  into  three  groups,  the  three  with  the  high¬ 
est  weights,  the  three  with  the  lowest  weights, 
and  the  middle  or  remaining  three.  Voting  first 
occurs  within  the  group,  and  the  group’s  vote  is 
brought  forward  with  a  weight  equaling  the  sum 
of  the  group  members’  weights.  This  variation 
also  adds  to  the  vote  total  if  the  sense  was  as¬ 
signed  in  the  training  data.  The  Combo  variation 
is  like  Triples,  but  rather  than  using  the  weights 
and  thresholds  calculated  for  the  single  measures 
from  the  training  data,  this  variation  calculates 
weights  and  thresholds  for  combinations  of  two, 
three,  four,  five,  six,  and,  seven  measures.  Finally, 
the  Combo&Auto  variation  adds  the  same-synset 
assumption  to  the  previous  variation. 


Although  not  evident  in  Table  3  because  of 
rounding,  AutoMap-  has  slightly  higher  values  for 
both  recall  and  precision  than  does  AutoMap-i-, 
giving  it  the  highest  recall-precision  product  of  the 
threshold  voting  schemes.  This  suggests  that  the 
Grid  and  Levin-i-  probabilities  could  profitably  be 
dropped  from  further  use. 

Of  the  more  exotic  voting  variations.  Triples 
voting  achieved  results  nearly  as  good  as  the  Au¬ 
toMap  voting  schemes,  but  the  Combo  schemes 
fell  short,  indicating  that  weights  and  thresholds 
are  better  based  on  single  measures  than  combi¬ 
nations  of  measures. 

6  Conclusions  and  Futur  e  Woik 

The  voting  schemes  still  leave  room  for  improve¬ 
ment,  as  the  best  results  (58%  recall  and  72%  pre¬ 
cision,  or,  optimistically,  63%  recall  and  67%  pre¬ 
cision)  fall  shy  of  the  upper  bound  of  73%  re¬ 
call  and  87%  precision  for  human  coding.^  At 
the  same  time,  these  results  are  far  better  than  the 
lower  bound  of  38%  recall  and  62%  precision  for 
the  most  frequent  WordNet  sense. 

As  has  been  true  in  many  other  evaluation  stud¬ 
ies,  the  best  results  come  from  combining  classi¬ 
fiers  (MajAggr):  not  only  does  this  variation  use 
a  majority  voting  scheme,  but  more  importantly, 
the  two  voters  take  into  account  all  of  the  sim¬ 
ple  voters,  in  different  ways.  The  next-best  re¬ 
sults  come  from  Maj3Best,  in  which  the  three  best 
single  measures  vote.  We  should  note,  however, 
that  the  single  best  measure,  the  semantic  similar¬ 
ity  measure  from  SemSim,  lags  only  slightly  be¬ 
hind  the  two  best  voting  schemes. 

This  research  demonstrates  that  credible  word 
sense  disambiguation  results  can  be  achieved 
without  recourse  to  contextual  data.  Lexical  re¬ 
sources  enriched  with,  for  example,  syntactic  in¬ 
formation,  in  which  some  portion  of  the  resource 
is  hand-mapped  to  another  lexical  resource  may 
be  rich  enough  to  support  such  a  task.  The  degree 
of  success  achieved  here  also  owes  much  to  the 
confluence  of  WordNet’ s  hierachical  structure  and 
SEMCOR  tagging,  as  used  in  the  computation  of 
the  semantic  similarity  measure,  on  the  one  hand, 

^The  criteria  for  the  majority  voting  schemes  preclude 
their  assigning  more  than  2  senses  to  any  single  database  en¬ 
try.  Controlled  relaxation  of  these  criteria  may  achieve  some¬ 
what  better  results. 


and  the  classified  structure  of  the  verb  lexicon, 
which  provided  the  underlying  groupings  used  in 
that  measure,  on  the  other  hand.  Even  where  one 
measure  yields  good  results,  several  data  sources 
needed  to  he  combined  to  enable  its  success. 
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